Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching

نویسنده

  • MUNEEB KHAN
چکیده

Khan, M. 2016. Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1335. 54 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9450-6. Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefetching, to increase performance. Such complex hardware structures have helped improve performance in general, however, their full potential is not realized as software often utilizes the memory hierarchy inefficiently. Performance can be improved further by ensuring careful interaction between software and hardware. Performance can typically improve by increasing the cache utilization and by conserving the DRAM bandwidth, i.e., retaining more useful data in the caches and lowering data requests to the DRAM. One way to achieve this is to conserve space across the cache hierarchy and increase opportunity for temporal reuse of cached data. Similarly, conserving the DRAM bandwidth is essential for performance in highly utilized multicores, as it can easily become a critical resource. When multiple cores are active and the per-core share of DRAM bandwidth shrinks, its efficient utilization plays an important role in improving the overall performance. Together the cache hierarchy and the DRAM bandwidth play a significant role in defining the overall performance in multicores. Based on deep insight from memory behavior modeling of software, this thesis explores five software-only methods to analyze and increase performance in multicores. The underlying philosophy that drives these techniques is to increase cache utilization and conserve DRAM bandwidth by 1) focusing on making data prefetching more accurate, and 2) lowering the miss rate in the cache hierarchy either by preserving useful data longer by cache-bypassing the less useful data or via code size compaction using compiler options. First, we show how microarchitecture-independent memory access profiles can be used to analyze the Instruction Cache performance of software. We use this information in a compiler pass to recompile application phases (with large Instruction cache miss rate) for smaller code size in an effort to improve the application Instruction Cache behavior. Second, we demonstrate how a resourceefficient software prefetching method can be combined with hardware prefetching to improve performance in multicores when running software that exhibits irregular memory access patterns. Third, we show that hardware prefetching on high performance commodity multicores is sub-optimal and demonstrate how a resource-efficient software-only prefetching method can perform better in fully utilized multicores. Fourth, we present an adaptive prefetching approach that dynamically combines software and hardware prefetching in a runtime system to improve performance in highly utilized multicores. Finally, in the fifth work we develop a method to predict per-core prefetching configurations that deliver near-optimal overall multicore performance. These software techniques enable us to tap greater performance in multicores (up to 50%), without requiring more processing resources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores

Aggressive hardware prefetching is extremely beneficial for single-threaded performance but can lead to significant slowdowns in multicore processors due to oversubscription of off-chip bandwidth and shared cache capacity. This work addresses this problem by adjusting prefetching on a per-application basis to improve overall system performance. Unfortunately, an exhaustive search of all possibl...

متن کامل

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

|As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its success has been limited thus far to array-based numeric codes. In this paper, we expand the scope of automatic compiler-inserted prefetching to also...

متن کامل

A Radon-based Convolutional Neural Network for Medical Image Retrieval

Image classification and retrieval systems have gained more attention because of easier access to high-tech medical imaging. However, the lack of availability of large-scaled balanced labelled data in medicine is still a challenge. Simplicity, practicality, efficiency, and effectiveness are the main targets in medical domain. To achieve these goals, Radon transformation, which is a well-known t...

متن کامل

Phase-based tuning for better utilized performance-asymmetric multicores

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016